excess risk excess risk
Nearly Optimal Differentially Private ReLU Regression
Ding, Meng, Lei, Mingxi, Wang, Shaowei, Zheng, Tianhang, Wang, Di, Xu, Jinhui
In this paper, we investigate one of the most fundamental nonconvex learning problems, ReLU regression, in the Differential Privacy (DP) model. Previous studies on private ReLU regression heavily rely on stringent assumptions, such as constant bounded norms for feature vectors and labels. We relax these assumptions to a more standard setting, where data can be i.i.d. sampled from $O(1)$-sub-Gaussian distributions. We first show that when $\varepsilon = \tilde{O}(\sqrt{\frac{1}{N}})$ and there is some public data, it is possible to achieve an upper bound of $\Tilde{O}(\frac{d^2}{N^2 \varepsilon^2})$ for the excess population risk in $(\epsilon, \delta)$-DP, where $d$ is the dimension and $N$ is the number of data samples. Moreover, we relax the requirement of $\epsilon$ and public data by proposing and analyzing a one-pass mini-batch Generalized Linear Model Perceptron algorithm (DP-MBGLMtron). Additionally, using the tracing attack argument technique, we demonstrate that the minimax rate of the estimation error for $(\varepsilon, \delta)$-DP algorithms is lower bounded by $\Omega(\frac{d^2}{N^2 \varepsilon^2})$. This shows that DP-MBGLMtron achieves the optimal utility bound up to logarithmic factors. Experiments further support our theoretical results.
- North America > United States > California (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Belgium > Flanders > East Flanders > Ghent (0.04)
- Asia > China > Guangdong Province > Guangzhou (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.54)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)
FIRAL: An Active Learning Algorithm for Multinomial Logistic Regression
We investigate theory and algorithms for pool-based active learning for multiclass classification using multinomial logistic regression. Using finite sample analysis, we prove that the Fisher Information Ratio (FIR) lower and upper bounds the excess risk. Based on our theoretical analysis, we propose an active learning algorithm that employs regret minimization to minimize the FIR. To verify our derived excess risk bounds, we conduct experiments on synthetic datasets. Furthermore, we compare FIRAL with five other methods and found that our scheme outperforms them: it consistently produces the smallest classification error in the multiclass logistic regression setting, as demonstrated through experiments on MNIST, CIFAR-10, and 50-class ImageNet.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- (2 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
How Transformers Utilize Multi-Head Attention in In-Context Learning? A Case Study on Sparse Linear Regression
Chen, Xingwu, Zhao, Lei, Zou, Difan
Despite the remarkable success of transformer-based models in various real-world tasks, their underlying mechanisms remain poorly understood. Recent studies have suggested that transformers can implement gradient descent as an in-context learner for linear regression problems and have developed various theoretical analyses accordingly. However, these works mostly focus on the expressive power of transformers by designing specific parameter constructions, lacking a comprehensive understanding of their inherent working mechanisms post-training. In this study, we consider a sparse linear regression problem and investigate how a trained multi-head transformer performs in-context learning. We experimentally discover that the utilization of multi-heads exhibits different patterns across layers: multiple heads are utilized and essential in the first layer, while usually only a single head is sufficient for subsequent layers. We provide a theoretical explanation for this observation: the first layer preprocesses the context data, and the following layers execute simple optimization steps based on the preprocessed context. Moreover, we demonstrate that such a preprocess-then-optimize algorithm can significantly outperform naive gradient descent and ridge regression algorithms. Further experimental results support our explanations. Our findings offer insights into the benefits of multi-head attention and contribute to understanding the more intricate mechanisms hidden within trained transformers.
- Asia > China > Hong Kong (0.04)
- North America > United States > California > Orange County > Irvine (0.04)
Selective Nonparametric Regression via Testing
Noskov, Fedor, Fishkov, Alexander, Panov, Maxim
Prediction with the possibility of abstention (or selective prediction) is an important problem for error-critical machine learning applications. While well-studied in the classification setup, selective approaches to regression are much less developed. In this work, we consider the nonparametric heteroskedastic regression problem and develop an abstention procedure via testing the hypothesis on the value of the conditional variance at a given point. Unlike existing methods, the proposed one allows to account not only for the value of the variance itself but also for the uncertainty of the corresponding variance predictor. We prove non-asymptotic bounds on the risk of the resulting estimator and show the existence of several different convergence regimes. Theoretical analysis is illustrated with a series of experiments on simulated and real-world data.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- (4 more...)
High-dimensional analysis of double descent for linear regression with random projections
Over-parameterized models estimated with some form of gradient descent come in various forms, such as linear regression with potentially non-linear features, neural networks, or kernel methods. The double descent phenomenon can be seen empirically in several of these models [6, 15]: Given a fixed prediction problem, when the number of parameters of the model is increasing from zero to the number of observations, the generalization performance traditionally goes down and then up, due to overfitting. Once the number of parameters exceeds the number of observations, the generalization error decreases again, as illustrated in Figure 1. The phenomenon has been theoretically analyzed in several settings, such as random features based on neural networks [27], random Fourier features [24], or linear regression [7, 17]. While the analysis of [27, 24] for random features corresponds to a single prediction problem with a sequence of increasingly larger prediction models, most of the analysis of [17] for linear regression does not consider a single problem, but varying problems, which does not actually lead to a double descent curve. Random subsampling on a single prediction problem was analyzed with a simpler model with isotropic covariance matrices in [7] and [17, Section 5.2], but without a proper double descent as the model is too simple to account for a U-shaped curve in the under-parameterized regime. In work related to ours, principal component regression was analyzed by [37] with a double descent curve but with less general assumptions regarding the spectrum of the covariance matrix and the optimal predictor.
Relaxing the Feature Covariance Assumption: Time-Variant Bounds for Benign Overfitting in Linear Regression
Xu, Jing, Teng, Jiaye, Yao, Andrew Chi-Chih
Benign overfitting demonstrates that overparameterized models can perform well on test data while fitting noisy training data. However, it only considers the final min-norm solution in linear regression, which ignores the algorithm information and the corresponding training procedure. In this paper, we generalize the idea of benign overfitting to the whole training trajectory instead of the min-norm solution and derive a time-variant bound based on the trajectory analysis. Starting from the time-variant bound, we further derive a time interval that suffices to guarantee a consistent generalization error for a given feature covariance. Unlike existing approaches, the newly proposed generalization bound is characterized by a time-variant effective dimension of feature covariance. By introducing the time factor, we relax the strict assumption on the feature covariance matrix required in previous benign overfitting under the regimes of overparameterized linear regression with gradient descent. This paper extends the scope of benign overfitting, and experiment results indicate that the proposed bound accords better with empirical evidence.
- North America > United States > California > Los Angeles County > Long Beach (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- (13 more...)